Lecture 3.1 - Practicing Confidence Intervals
Interpretation steps
Let’s return to the titanic_survival.csv
dataset.
Setting things up
Create new variable called
survivedindicator
; assign it a value of 0 if the passenger did not survive and 1 if the passenger did survive using thecase_when
verb (information on how to do that can be found here or here) . Find the proportion that survived in the entire dataset (you can do this by using thetable()
command:table(titanic$survivedindicator)
.Calculate by hand the standard deviation and 95% confidence interval for
survivedindicator
of a sample of 50 and 200 based on the standard deviation ofsurvivedindicator
in the full dataset.Interpret these confidence intervals
Random matters
Now let’s sample from the dataset. We can sample whether or not they survived by running the following code (note: you need the library dplyr
loaded for this code):
titanic_draws_50 <- titanic_survival %>%
slice_sample(survivedindicator, n=50)
Create a sample of size 50 and 200.
Find the proportion that survived of the sample of 50 and 200 and compare it to your expected value – was it inside or outside the standard error? Why was it inside or outside?
If you sampled many times, how many sample proportions would be inside or outside of your confidence interval?
Sampling distributions
Let’s now add the following command to the setup
block (copy and paste the entire part and then run your setup
code block again:
prop.multiple.samples <- function(n, numsamples, variable) {
meanvector <- c()
meanonesample <- 0
for (i in 1:numsamples) {
meanonesample <- mean(sample(variable, n, replace=TRUE))
meanvector[i] <- meanonesample
}
meanvector
}
This defines a new function in R called prop.multiple.samples()
. It takes as its arguments the sample size (n
), the number of samples (numsamples
) and the variable from which you would like to create a sampling distribution. Once you have created this function, you can use it as follows:
titanic_50_100 <-prop.multiple.samples(50, 100, titanic_survival$survivedindicator)
This takes 100 samples of size 50 from the variable titanic_survival$survivedindicator
. In practical terms, this line of code draws 50 people at random from the dataset 100 times and then calculate the proportion in each sample of the number of people surviving.
Observing the sampling distribution
Make a histogram using ggplot of the results of
titanic_50_100
. What does this histogram show? How is it different than the histogram you made from your first sampletitanic_draws_50
? Interpret this carefully.Calculate the
sd()
oftitanic_50_100
- what is this quantity indicate? What calculation should it be equal to? Why?If you increased the number of items sampled (
n
or the first entry inprop.multiple.samples
) what do you think will happen to your histogram? How about thesd()
you calculated?. How about if you increasednumsamples
instead?
Comparing the distribution
- Increase the
n
andnumsamples
separately (try values like 200 and 500). How does the shape and distribution of the histogram change? Did it match your expectations? Why or why not?
Extra activity - modeling
While it is more common to use a more advanced regression form (logistic regression) to model an outcome variable with a 0 or 1 outcome, in this case please try to use linear regression make a model that predicts what factors are most important in predicting survival on the Titanic.